Haplotype assembly in polyploid genomes and identical by descent shared tracts
نویسندگان
چکیده
MOTIVATION Genome-wide haplotype reconstruction from sequence data, or haplotype assembly, is at the center of major challenges in molecular biology and life sciences. For complex eukaryotic organisms like humans, the genome is vast and the population samples are growing so rapidly that algorithms processing high-throughput sequencing data must scale favorably in terms of both accuracy and computational efficiency. Furthermore, current models and methodologies for haplotype assembly (i) do not consider individuals sharing haplotypes jointly, which reduces the size and accuracy of assembled haplotypes, and (ii) are unable to model genomes having more than two sets of homologous chromosomes (polyploidy). Polyploid organisms are increasingly becoming the target of many research groups interested in the genomics of disease, phylogenetics, botany and evolution but there is an absence of theory and methods for polyploid haplotype reconstruction. RESULTS In this work, we present a number of results, extensions and generalizations of compass graphs and our HapCompass framework. We prove the theoretical complexity of two haplotype assembly optimizations, thereby motivating the use of heuristics. Furthermore, we present graph theory-based algorithms for the problem of haplotype assembly using our previously developed HapCompass framework for (i) novel implementations of haplotype assembly optimizations (minimum error correction), (ii) assembly of a pair of individuals sharing a haplotype tract identical by descent and (iii) assembly of polyploid genomes. We evaluate our methods on 1000 Genomes Project, Pacific Biosciences and simulated sequence data. AVAILABILITY AND IMPLEMENTATION HapCompass is available for download at http://www.brown.edu/Research/Istrail_Lab/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
منابع مشابه
HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data
As the more recent next-generation sequencing (NGS) technologies provide longer read sequences, the use of sequencing datasets for complete haplotype phasing is fast becoming a reality, allowing haplotype reconstruction of a single sequenced genome. Nearly all previous haplotype reconstruction studies have focused on diploid genomes and are rarely scalable to genomes with higher ploidy. Yet com...
متن کاملTumor Haplotype Assembly Algorithms for Cancer Genomics
The growing availability of inexpensive high-throughput sequence data is enabling researchers to sequence tumor populations within a single individual at high coverage. But, cancer genome sequence evolution and mutational phenomena like driver mutations and gene fusions are difficult to investigate without first reconstructing tumor haplotype sequences. Haplotype assembly of single individual t...
متن کاملDe novo identification of “heterotigs” towards accurate and in-phase assembly of complex plant genomes
Accurate and in-phase de novo assembly of highly polymorphic diploid and polyploid plant genomes remains a critical yet unsolved problem. “Out-of-the-box” assemblies on such data can produce numerous small contigs, at lower than expected coverage, which are hypothesized to represent sequences that are not uniformly present on all copies of a homologous set of chromosomes. Such “heterotigs” are ...
متن کاملThe haplotype-resolved genome sequence of hexaploid Ipomoea batatas reveals its evolutionary history
Although the sweet potato, Ipomoea batatas, is the seventh most important crop in the world and the fourth most significant in China, its genome has not yet been sequenced. The reason, at least in part, is that the genome has proven very difficult to assemble, being hexaploid and highly polymorphic; it has a presumptive composition of two B1 and four B2 component genomes (B1B1B2B2B2B2). By usin...
متن کاملInferring Demographic History from a Spectrum of Shared Haplotype Lengths
There has been much recent excitement about the use of genetics to elucidate ancestral history and demography. Whole genome data from humans and other species are revealing complex stories of divergence and admixture that were left undiscovered by previous smaller data sets. A central challenge is to estimate the timing of past admixture and divergence events, for example the time at which Nean...
متن کامل